Skip to content

Conversation

@pockers21
Copy link
Contributor

@pockers21 pockers21 commented Oct 14, 2025

Update Notes (2025‑11‑6)

  • CLI Merge
    • Fold the standalone Jina CLI into mtmd-cli’s projector‑only flow; remove the extra binary.
  • Conversion Script (set_gguf_parameters)
    • Emit vision keys using the standard naming: clip.has_vision_encoder, clip.vision.image_size/patch_size/embedding_length/
      block_count/projection_dim/feed_forward_length/attention.head_count.
    • Write only projector_type (set to 'jinaclip2'); do not introduce projector_version.
  • Inference (mtmd)
    • Use ggml_rope_ext to implement 2D RoPE; reuse bicubic for image preprocessing.
  • Minimal Validation
    • Conversion succeeds; gguf_dump shows clip.projector_type='jinaclip2'.
    • Minimal inference passes for both text and image; C++ vs Python cosine/RMSE are within the expected range.
      Reproduction
Minimal commands & data (CPU)
  • Produce GGUF (with ST pooling metadata)
    • Text: jina-bert-v3.pooling_type = MEAN/CLS/LAST
    • Vision: clip.projector_type = jinaclip2, clip.vision.rope_theta = 10000 (default)
  • Text parity
    • C++: CUDA_VISIBLE_DEVICES= ./build/bin/llama-embedding -m /path/jina-text-converted.gguf -p "hello world" --n-gpu-layers 0 --pooling mean --embd-normalize 2 --embd-output-format array
    • Python: python3 <ref>/debug.py --mode text --input "hello world" --out-dir <dir> --fa off
    • Metric: read both 512-d outputs and compute cosine / RMSE
  • Image parity
    • C++: CUDA_VISIBLE_DEVICES= ./build/bin/llama-mtmd-cli --mmproj /path/mmproj-jina-vision-converted.gguf --image /path/img.jpg --n-gpu-layers 0 --embd-normalize 2 --embd-output-format array
    • Python: python3 <ref>/debug.py --mode image --input /path/img.jpg --out-dir <dir> --fa off
    • Metric: read both 512-d outputs and compute cosine / RMSE

mtmd: Add JinaCLIP v2 vision projector + GGUF support for jina-bert-v3 (merged-LoRA or adapter)

Overview

  • Converter: write jina-bert-v3 text tower params into GGUF (supports both merged-LoRA checkpoints and adapter-based inputs), and export vision metadata (projector_type=jinaclip, vision.rope_theta, image_size, patch_size, projection_dim, etc.).
  • Runtime: introduce PROJECTOR_TYPE_JINACLIP in the MTMD path (JinaCLIP v2 vision tower: 2D RoPE with shared frequency cache, attention/FFN internal LayerNorm, single-token output), and normalize with common_embd_normalize(..., 2).
  • CLI (core): add a minimal validation tool llama-jinaclip-cli (built by default) for text/image embedding numerical/performance checks; depends only on common+mtmd+Threads, cross-platform buildable, no third-party deps.
  • Compatibility: only activates when related GGUF metadata exists; doesn’t affect other projectors (e.g., LLaVA/Qwen2VL); no ggml op changes; no external dependencies.

Scope of changes

  • convert_hf_to_gguf.py
    • Text: support both merged-LoRA single checkpoints and adapter-based export.
    • Vision (JinaCLIP v2): export clip.projector_type=jinaclip, clip.vision.rope_theta (configurable), image_size/patch_size/projection_dim, and map tensors for fused/non-fused QKV.
  • tools/mtmd/clip.cpp, tools/mtmd/clip-impl.h
    • Add PROJECTOR_TYPE_JINACLIP: JinaCLIP v2 vision tower (2D RoPE with shared freq cache), attention internal LN, FFN sub-layer LN (enabled when both weight/bias present), single-token output (CLS-equivalent), unified L2 normalize.
    • clip_n_output_tokens() returns 1 for JinaCLIP; clip_n_mmproj_embd() returns projection_dim.
  • tools/mtmd/jinaclip-cli.cpp, tools/mtmd/CMakeLists.txt
    • Add llama-jinaclip-cli target (default); one command covers text/image minimal validation, thread scaling, encode_ms reporting, and saves embeddings for Python parity.

Validation summary

  • CI: CPU-only ci/run.sh passes locally; no ggml op changes in this PR.
  • Correctness: embedding models have no perplexity; we verify via C++ vs Python parity.
    • TEXT (CPU, minimal sample): cosine=0.999996, RMSE=0.000125
    • IMAGE (CPU, minimal sample): cosine=0.990261, RMSE=0.006168
  • Performance: checked with CLI encode_ms and thread scaling; no regression observed. More data can be added if requested.
  • Compatibility: activated only when GGUF metadata (projector_type=jinaclip, etc.) is present; other projectors unaffected.
  • Reference: ModelScope uniontech-yourong/split_jina (used for Python-side parity).

Performance (absolute metrics, CPU-only minimal samples)

  • Environment
    • OS: Ubuntu 22.04.5 LTS
    • CPU: Intel Xeon Platinum 8352V (dual-socket, 2×32C/64T, SMT on), 128 threads total
    • Build: Release, GGML_CUDA=OFF (CPU-only), GCC 11.4, CMake 3.22
    • Model: JinaCLIP v2 vision tower (image_size=512, patch=14, depth=24, hidden=1024; official: https://huggingface.co/jinaai/jina-clip-v2); text tower (Jina Embeddings v3, output truncated to 512 dims)
    • Threads: primarily 8 threads for both text/image (with 1-thread comparison)
  • Metric definitions
    • Text: use CLI-reported JINACLIP_ENCODE_MS (pure inference, excludes load)
    • Image: use CLI line “image … done in … ms” (pure inference, excludes load)
  • Results (single sample, minimal)
    • Text (“hello world”, ≈5 tokens)
      • 1 thread: encode_ms ≈ 180.48 ms
      • 8 threads: encode_ms ≈ 34.08 ms
    • Image (512×512, single)
      • 8 threads: image done in ≈ 6154 ms (stabilizes ~6.1–6.4 s after warm-up)
  • Notes
    • Above numbers are CPU-only pure inference; end-to-end (including model load) is higher and not included.

GPU group (absolute metrics, minimal samples)

  • Environment
    • GPU: NVIDIA vGPU-32GB (cc=8.9, 32 GB), Driver 550.107, CUDA 12.4
    • Build: Release, GGML_CUDA=ON (CUDA backend), CUDA arch=89
    • Threads: -t 8 (host-side preprocessing threads)
  • Results (pure inference, excludes load)
    • Text (“hello world”, ≈5 tokens): encode_ms ≈ 84.88 ms
    • Image (512×512, single): image done in ≈ 827 ms

@pockers21 pockers21 requested review from CISC and ngxson as code owners October 14, 2025 09:04
@github-actions github-actions bot added examples python python script changes labels Oct 14, 2025
Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a minimal validation tool llama-jinaclip-cli (built by default) for text/image embedding numerical/performance checks;

I don't see why wee need to add this new CLI. The mtmd-cli can do this with -p and --image params

Comment on lines 63 to 67

# JinaCLIP CLI (align style with other targets above)
set(TARGET llama-jinaclip-cli)
add_executable (${TARGET} jinaclip-cli.cpp)
target_link_libraries (${TARGET} PRIVATE common mtmd Threads::Threads)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should try to merge this with mtmd-cli to avoid the "fragmentation" trap of the old llava-cli binary

Copy link
Contributor Author

@pockers21 pockers21 Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree to merge into llama-mtmd-cli to avoid adding another standalone CLI.
However, Jina embedding could support text and vision embedding with individual model, differs from the existing mtmd-cli workflow which needs text and vision model in the same time.
I will add a Jina‑specific path in mtmd-cli that supports running with only --mmproj + --image.

Comment on lines 6306 to 6312
self.gguf_writer.add_uint32("clip.vision.image_size", img_sz)
self.gguf_writer.add_uint32("clip.vision.patch_size", patch_sz)
self.gguf_writer.add_uint32("clip.vision.embedding_length", n_embd)
self.gguf_writer.add_uint32("clip.vision.block_count", n_layer)
self.gguf_writer.add_uint32("clip.vision.projection_dim", proj_dim)
self.gguf_writer.add_uint32("clip.vision.feed_forward_length", n_ff)
self.gguf_writer.add_uint32("clip.vision.attention.head_count", n_head)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had specific functions and constants to add these metadata keys. Use them instead


# Top-level direct mappings
if src_no_vm == 'cls_token':
return [('v.cls_token', data_torch)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use proper mapping instead

Comment on lines 2229 to 2237
if (!ctx->jinaclip_rope_initialized) {
const int half_dim = rope_dim / 2;
std::vector<float> base_freqs(half_dim);
for (int i = 0; i < half_dim; i++) {
float arange_val = i * 2.0f; // [0, 2, 4, ..., 30]
float normalized = arange_val / rope_dim; // [0, 2/32, 4/32, ..., 30/32]
float theta_powered = powf(freq_base, normalized); // theta^normalized
base_freqs[i] = 1.0f / theta_powered; // 1.0 / theta^normalized
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you're trying to do here, is this just 2D RoPE? (which we already supported)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn’t re‑implementing generic 2D RoPE; it implements JinaCLIP’s VisionRotaryEmbeddingFast.
It uses fractional‑position 2D RoPE (t = arange(ft)/ft * pt) and precomputes a full H×W cos/sin grid; the official 2D RoPE uses integer grid positions (pos_h/pos_w) with ggml_rope_ext and does not include these steps.
This is done to strictly match Jina’s Python semantics.

Copy link
Collaborator

@ngxson ngxson Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fractional‑position 2D RoPE (t = arange(ft)/ft * pt)

Based on your code:

time_seq[i] = (float) i / ft_seq_len * pt_seq_len;  // [0, 16/36, 32/36, ..., 560/36]
...
freqs_h[t * half_dim + f] = time_seq[t] * base_freqs[f];

Then why don't we scale base_freqs[f] instead? The third param of ggml_rope_ext, the c tensor (freq_scale) is made for this purpose.

Honestly I think this is just YaRN

}

clip_image_u8 resized_keep_ratio;
image_manipulation::bicubic_pil_resize(*img, resized_keep_ratio, out_w, out_h);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally pre-processing doesn't need to be byte-exact. I would prefer keeping the old bicubic_resize to keep it simple.

Comment on lines 5029 to 5415
self.model_arch = gguf.MODEL_ARCH.JINA_BERT_V3

# Jina v3 (RoPE) without LoRA should export as jina-bert-v3 to avoid expecting absolute position embeddings
try:
text_cfg = hparams.get("text_config", {}) if isinstance(hparams.get("text_config", {}), dict) else {}
pe_type = (text_cfg.get("position_embedding_type") or hparams.get("position_embedding_type") or "").lower()
rope_base = text_cfg.get("rotary_emb_base", hparams.get("rotary_emb_base"))
name_path = (hparams.get("_name_or_path") or "").lower()
is_v3 = (pe_type == "rotary" or rope_base is not None) and ("jina" in name_path and "v3" in name_path)
if is_v3 and not self._lora_names:
self.model_arch = gguf.MODEL_ARCH.JINA_BERT_V3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain this, first off it breaks jina-embeddings-v3 conversion, secondly jina-clip-v2 looks like it loads jina-embeddings-v3 and uses the retrieval.query LoRA/prompt, but load_trained_adapters set to false suggests it's not applied?
https://huggingface.co/jinaai/jina-clip-v2/blob/main/config.json#L15-L38

pockers21 pushed a commit to pockers21/llama.cpp that referenced this pull request Oct 21, 2025
…_rope_ext; clean dead code/logs

- mtmd-cli: remove file output option and writing; keep --embd-output-format for stdout; unify image load via mtmd-helper; style fixes
- clip.cpp: unify 2D RoPE via ggml_rope_ext using per-dim c tensors (c_first=1/s, c_second=1/(s*odd)); remove precompute path; drop unused bicubic helpers; silence unused param warnings
- mtmd-helper: add noctx bitmap loader used by projector-only path
- convert_hf_to_gguf.py: robust JINA_BERT_V3 detection ((is_v3) or LoRA); remove debug prints; keep upstream tokenizer sample intact
- cleanup: remove debug artifacts (status md, analyze scripts), align with PR ggml-org#16574 guidance
pockers21 pushed a commit to pockers21/llama.cpp that referenced this pull request Oct 22, 2025
…fix converter keys, rope_ext + bicubic

mtmd-cli: move the standalone Jina CLI into mtmd-cli (projector-only path); drop the extra binary.
@pockers21 pockers21 force-pushed the feature/jinaclip-v2-projector branch from fd37a5c to 9d02918 Compare October 22, 2025 08:39
pockers21 pushed a commit to pockers21/llama.cpp that referenced this pull request Oct 22, 2025
…fix converter keys, rope_ext + bicubic

mtmd-cli: move the standalone Jina CLI into mtmd-cli (projector-only path); drop the extra binary.
@pockers21 pockers21 force-pushed the feature/jinaclip-v2-projector branch from 9d02918 to e19eb27 Compare October 22, 2025 10:35
pockers21 pushed a commit to pockers21/llama.cpp that referenced this pull request Oct 22, 2025
…fix converter keys, rope_ext + bicubic

mtmd-cli: move the standalone Jina CLI into mtmd-cli (projector-only path); drop the extra binary.
@pockers21 pockers21 force-pushed the feature/jinaclip-v2-projector branch from e19eb27 to 2d8885b Compare October 22, 2025 10:36
pockers21 pushed a commit to pockers21/llama.cpp that referenced this pull request Oct 22, 2025
…fix converter keys, rope_ext + bicubic

mtmd-cli: move the standalone Jina CLI into mtmd-cli (projector-only path); drop the extra binary.
@pockers21 pockers21 force-pushed the feature/jinaclip-v2-projector branch from 2d8885b to b9f78de Compare October 22, 2025 11:07
pockers21 pushed a commit to pockers21/llama.cpp that referenced this pull request Oct 23, 2025
…fix converter keys, rope_ext + bicubic

mtmd-cli: move the standalone Jina CLI into mtmd-cli (projector-only path); drop the extra binary.
@pockers21 pockers21 force-pushed the feature/jinaclip-v2-projector branch from b9f78de to 2787888 Compare October 23, 2025 02:07
@pockers21 pockers21 marked this pull request as draft October 24, 2025 05:45
@pockers21 pockers21 force-pushed the feature/jinaclip-v2-projector branch 2 times, most recently from 46f9ee2 to 542ed6a Compare October 28, 2025 03:17
pockers21 pushed a commit to pockers21/llama.cpp that referenced this pull request Oct 28, 2025
…fix converter keys, rope_ext + bicubic

mtmd-cli: move the standalone Jina CLI into mtmd-cli (projector-only path); drop the extra binary.
@pockers21 pockers21 force-pushed the feature/jinaclip-v2-projector branch 4 times, most recently from 445e0d5 to bd46020 Compare October 28, 2025 10:02
@CISC
Copy link
Collaborator

CISC commented Oct 28, 2025

@pockers21 What's up?

@pockers21
Copy link
Contributor Author

pockers21 commented Oct 29, 2025

@pockers21 What's up?

I’m currently adjusting the code and fixing issues. I originally planned to answer your questions together when
moving the PR from draft to a formal PR, let me explain now. The link you shared (https://huggingface.co/jinaai/jina-clip-v2/blob/main/config.json#L15-L38) points to the official Jina config that includes LoRA. In our work, we
modified the official Jina to fuse the text-side LoRA into the base model and then exported it to GGUF. Under JINA
logic, those fields won’t take effect when loading Jina v2; they are only triggered when loading the embeddings v3
model.

@pockers21 pockers21 force-pushed the feature/jinaclip-v2-projector branch 2 times, most recently from 7e0b15b to 2338880 Compare October 29, 2025 08:43
pockers21 pushed a commit to pockers21/llama.cpp that referenced this pull request Nov 4, 2025
…icubic;switch to 'jinaclip2'; fix converter constants
@pockers21 pockers21 force-pushed the feature/jinaclip-v2-projector branch from ebe1377 to 87431ef Compare November 4, 2025 02:18
pockers21 pushed a commit to pockers21/llama.cpp that referenced this pull request Nov 4, 2025
…icubic;switch to 'jinaclip2'; fix converter constants
@pockers21 pockers21 force-pushed the feature/jinaclip-v2-projector branch from 87431ef to d6a9167 Compare November 4, 2025 02:30
pockers21 pushed a commit to pockers21/llama.cpp that referenced this pull request Nov 4, 2025
…icubic;switch to 'jinaclip2'; fix converter constants
@pockers21 pockers21 force-pushed the feature/jinaclip-v2-projector branch from d6a9167 to ee7aace Compare November 4, 2025 02:42
pockers21 pushed a commit to pockers21/llama.cpp that referenced this pull request Nov 4, 2025
…icubic;switch to 'jinaclip2'; fix converter constants
@pockers21 pockers21 force-pushed the feature/jinaclip-v2-projector branch from ee7aace to 84f4c5c Compare November 4, 2025 03:04
pockers21 pushed a commit to pockers21/llama.cpp that referenced this pull request Nov 4, 2025
…icubic;switch to 'jinaclip2'; fix converter constants
@pockers21 pockers21 force-pushed the feature/jinaclip-v2-projector branch from 84f4c5c to 3e4f6ed Compare November 4, 2025 03:28
pockers21 pushed a commit to pockers21/llama.cpp that referenced this pull request Nov 4, 2025
…icubic;switch to 'jinaclip2'; fix converter constants
@pockers21 pockers21 force-pushed the feature/jinaclip-v2-projector branch from 3e4f6ed to 2f94dcc Compare November 4, 2025 03:42
pockers21 pushed a commit to pockers21/llama.cpp that referenced this pull request Nov 4, 2025
…icubic;switch to 'jinaclip2'; fix converter constants
@pockers21 pockers21 force-pushed the feature/jinaclip-v2-projector branch from 2f94dcc to 6ce3d1c Compare November 4, 2025 06:09
pockers21 pushed a commit to pockers21/llama.cpp that referenced this pull request Nov 4, 2025
…icubic;switch to 'jinaclip2'; fix converter constants
@pockers21 pockers21 force-pushed the feature/jinaclip-v2-projector branch from 6ce3d1c to bab814d Compare November 4, 2025 07:00
pockers21 pushed a commit to pockers21/llama.cpp that referenced this pull request Nov 5, 2025
…icubic;switch to 'jinaclip2'; fix converter constants
@pockers21 pockers21 force-pushed the feature/jinaclip-v2-projector branch from bab814d to cd5ad71 Compare November 5, 2025 10:12
pockers21 pushed a commit to pockers21/llama.cpp that referenced this pull request Nov 5, 2025
…icubic;switch to 'jinaclip2'; fix converter constants
@pockers21 pockers21 force-pushed the feature/jinaclip-v2-projector branch from cd5ad71 to b00c720 Compare November 5, 2025 10:18
pockers21 pushed a commit to pockers21/llama.cpp that referenced this pull request Nov 5, 2025
…icubic;switch to 'jinaclip2'; fix converter constants
@pockers21 pockers21 force-pushed the feature/jinaclip-v2-projector branch from b00c720 to c2bfa3d Compare November 6, 2025 01:43
pockers21 pushed a commit to pockers21/llama.cpp that referenced this pull request Nov 6, 2025
…icubic;switch to 'jinaclip2'; fix converter constants
pockers21 pushed a commit to pockers21/llama.cpp that referenced this pull request Nov 6, 2025
…icubic;switch to 'jinaclip2'; fix converter constants
pockers21 pushed a commit to pockers21/llama.cpp that referenced this pull request Nov 6, 2025
…icubic;switch to 'jinaclip2'; fix converter constants
@pockers21 pockers21 force-pushed the feature/jinaclip-v2-projector branch from c2bfa3d to 1c5c964 Compare November 6, 2025 06:04
…icubic;switch to 'jinaclip2'; fix converter constants
@pockers21 pockers21 force-pushed the feature/jinaclip-v2-projector branch from 1c5c964 to 0eeb6fc Compare November 7, 2025 01:41
@pockers21 pockers21 marked this pull request as ready for review November 7, 2025 07:12
@pockers21 pockers21 requested a review from ggerganov as a code owner November 7, 2025 07:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples ggml changes relating to the ggml tensor library for machine learning python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants